1 Introduction

The acorde R package contains the necessary functions to reproduce the pipeline in this paper, a study by Arzalluz-Luque et al. in which we analyze networks of isoform co-usage using single-cell RNA-seq data (scRNA-seq).

The pipeline includes three basic analysis blocks:

  1. Single-cell isoform quantification and filtering. First, bulk long read data is used to generate tissue-specific transcript models. Short-read scRNA-seq data is then used for isoform quantification, and isoforms are filtered according to their Differential Expression (DE) status across multiple cell types.

  2. Detection of isoform co-expression. acorde includes the implementation of percentile correlations, a novel strategy to obtain noise-robust correlation estimates from scRNA-Seq data, and a semi-automated clustering approach to detect modules of co-expressed isoforms acorss cell types.

  3. Differential and co-Differential Isoform Usage analysis. DIU and co-DIU analysis are designed to leverage the multiple cell types contained in single-cell datasets, and enable the detection of genes that show isoform expression coordination. To couple these analysis with a biologically interpretable readout, we incorporate functional annotations onto isoform models, and use tappAS for functional analysis.

Since both the long read-transcriptome definition procedure and the functional analyses in [1] are based on external tools, the present R package does not incorporate neither of these two analysis steps. Instead, acorde contains the necessary functions and documentation to obtain a set of DIU and co-DIU genes using an single-cell, isoform-level expression matrix as input.

In addition, we provide all the necessary instructions to reproduce the figures and additional analyses included in Arzalluz-Luque et al. [1], and provide the isoform expression matrix employed during the study as internal data in the package.

2 Installation

Acorde can be installed from GitHub using devtools:

install.packages("devtools")
devtools::install_github("ConesaLab/acorde", build_vignettes = TRUE)

3 Getting ready

To run the analyses in this vignette, you’ll first need to load acorde:

# load acorde
library(acorde)

# load auxiliary packages
suppressPackageStartupMessages({
  library(dplyr)
  library(tibble)
  library(purrr)
  library(furrr)
  library(ggplot2)
  library(SingleCellExperiment)
})

In addition, we’ll require some additional packages for data handling and formatting. Most of them are signaled as acorde dependencies, so they will already be installed in your system.

To generate plots, we make use of the cowplot R package and the cowplot theme. After install:

install.packages("cowplot")

…you can load and set the theme of your R session as follows:

library(cowplot)
theme_set(theme_cowplot())

4 Input data

The acorde pipeline requires a single-cell isoform expression matrix as input. Single-cell isoform counts should be provided in the form of a data.frame or tibble object including isoforms as rows and cells as columns. Isoform identifiers can be supplied as rownames() or as an additional identifier column, as required by tibble.

To generate an isoform-level single-cell expression matrix, we first processed long read bulk data from ENCODE (provided by Wyman et al. [2]) to build a mouse neural transcriptome, and then used publicly-available scRNA-seq data by Tasic et al. [3] to quantify the expression of long read-defined isoforms in mouse neural cell types. Details to this process can be found in our manuscript (see Supplementary Note and Methods).

If you wish to reproduce the analyses in Arzalluz-Luque et al. [1]), you can load the tasic object to use our isoform-quantified dataset:

# load Tasic dataset
data("tasic")

# load Tasic metadata
data("metadata")

# use metadata to create a cell - cell type identity table
id_table <- metadata %>%
    select(run, cell_type) %>%
    dplyr::rename(cell = "run")

These contain two tibble objects. After quality control (see Methods in Arzalluz-Luque et al. [1]), the tasic tibble contains expression data for 16240 isoforms and 1591 cells belonging to 7 neural cell types:

Cell type Number of cells
Astrocyte 43
Endothelial Cell 29
GABA-ergic Neuron 729
Glutamatergic Neuron 711
Microglia 21
Oligodendrocyte 37
Oligodendrocyte Precursor Cell 21
# display format of tasic
tasic[1:6, 1:8]
#> # A tibble: 6 × 8
#>   transcript   SRR2138606 SRR2138607 SRR2138609 SRR2138610 SRR2138612 SRR2138614
#>   <chr>             <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1 NM_00100200…      0           2.21      7.74        46.5       12.6      0    
#> 2 XM_00653431…      0           0         0            0         36.8      0.878
#> 3 NM_00103330…      0          27.8    1195.         693.        16.0     28.6  
#> 4 NM_00134651…    324.          0         0            0          0        0    
#> 5 NM_00104010…      0          14.7       0.378       54.1        0       23.1  
#> 6 XM_01732159…      0.362       0         0         2013.       343.     270.   
#> # … with 1 more variable: SRR2138615 <dbl>

# number of cells and isoforms
dim(tasic)
#> [1] 16240  1592

Metadata is contained in the metadata tibble. This table was generated using Tasic et al. supplementary files, which were used to parse cell type labels for single-cell IDs (i.e. sequencing run IDs, included in the run column), among other information:

# show information contained in metadata
metadata %>% colnames
#>  [1] "run"            "GEO_accession"  "passes_QC"      "input_material"
#>  [5] "mouse_line"     "dissection"     "animal_ID"      "sample_name"   
#>  [9] "cell_type"      "subtype"        "cell_type.mod"

# display cell type labels
unique(metadata$cell_type)
#> [1] "Glutamatergic Neuron"           "GABA-ergic Neuron"             
#> [3] "Endothelial Cell"               "Astrocyte"                     
#> [5] "Microglia"                      "Oligodendrocyte"               
#> [7] "Oligodendrocyte Precursor Cell"

See ?tasic and ?metadata for details.

5 Isoform Differential Expression across multiple cell types

To select isoforms with robust co-variation across the 7 neural cell types, we first applied multi-group Differential Expression analysis, which will detect isoforms that are differentially expressed (DE) in at least one cell type.

To achieve this, we combined the zero-weighting strategy in the zinbwave R package with bulk-designed DE methods DESeq2 and edgeR. Both tools were set to detect DE across multiple groups. The incorporation of weights to these analyses and the correct application of both tools to scRNA-seq data was done following the Differential Expression section in the zinbwave vignette.

Acorde provides the cell_type_DE() function, which constitutes a wrapper to these two methodologies. This function takes a SingleCellExperiment object as input:

# convert tibble to count matrix
count.matrix <- column_to_rownames(tasic, "transcript") %>% 
  as.matrix
# round estimated counts from RSEM to generate integer values
count.matrix <- count.matrix %>% round

# create SingleCellExperiment object with Tasic data
sce <- SingleCellExperiment(assays = list(counts = count.matrix, 
                                          logcounts = log2(count.matrix + 1)),
                            colData = metadata)

By default, cell_type_DE() automatically calculates and stores zinbwave weights in the weights slot of the SingleCellExperiment object. Alternatively, you may set cell_type_DE(compute_weights = FALSE) and run zinbwave() yourself (note that computing weights is a computationally costly step, so we’ll use the BiocParallel R package to parallelize the process):

# load biocParallel
library(BiocParallel)

# compute weights
library(zinbwave)

sce <- zinbwave(sce, observationalWeights = TRUE, 
                BPPARAM = MulticoreParam(6))

Now we are ready to run isoform-level Differential Expression analysis. Set cell_type_DE(method = "both") to be able to compare multi-group DE results for both edgeR and DESeq2, or choose either "edgeR" or "DESeq2" to run just one DE analysis.

# run DE analysis using both DESeq2 and edgeR
de_results <- cell_type_DE(sce, AdjPvalue = 0.05, 
                           mode = "both", 
                           compute_weights = FALSE)

In the acorde manuscript, a downsampling strategy was used to balance cell type abundances between neural and glial cell types. Briefly, 50 runs of random sampling were performed to select 45 GABA and 45 glutamatergic neuron cells, and both edgeR and DESeq2 were run to obtain DE isoforms using each of the 50 downsampled versions of the data. Next, isoforms detected to be significantly DE by at least one of the methods in >50% of the runs were considered to be DE.

We hereby provide the DE consensus set that was used in the acorde paper to fully ensure the reproducibility of our results. From here on out, this vignette will display the results of using this DE isoform set:

# load consensus set of DE isoforms
data("consensus_DE_set")

# filter expression matrix
tasic_de <- tasic %>%
  filter(transcript %in% consensus_DE_set$transcript)

5.1 Other isoform filtering criteria

In addition to DE filtering, several functions are provided in acorde to control for other expression-related biases and remove isoforms prior to downstream analysis (or during quality control):

  • detect_sparse() flags isoforms that have high proportion of zeros in all cell types. Isoforms must have non-zero expression in a proportion higher than the supplied threshold in at least one cell type.
  • detect_minor_isoforms() flags isoforms with low gene-relative expression across cell types. Isoforms that represent a small proportion of the total gene’s expression will be flagged as minor.
  • detect_low_expression() flags isoforms with low mean/median counts across cell types.

For our manuscript, we removed minor isoforms using a 10% gene-relative expression threshold:

# detect minor isoforms
minor <- detect_minor_isoforms(tasic_de, id_table = id_table,
                               gene_tr_table = gene_tr_ID,
                               gene_expr_proportion = 0.1,
                               isoform_col = "transcript")

head(minor)
#> # A tibble: 6 × 2
#>   transcript     minor_isoform
#>   <chr>          <lgl>        
#> 1 NM_001001980.3 FALSE        
#> 2 NM_001004367.4 FALSE        
#> 3 NM_001012623.2 FALSE        
#> 4 NM_001012625.2 FALSE        
#> 5 NM_001015507.2 FALSE        
#> 6 NM_001017426.2 FALSE

# summary of minor isoforms detected
table(minor$minor_isoform)
#> 
#> FALSE  TRUE 
#>  8306   969

# get isoforms that were not flagged as minor
excl_minor <- filter(minor, minor_isoform == FALSE)

# apply filter
tasic_sp <- tasic_de %>% 
  filter(transcript %in% excl_minor$transcript)

Next, isoforms were removed if they belonged to genes with a single DE isoform since, in practice, no differential splicing can occur for single-isoform genes (or for genes for which only one isoform presents expression variation across cell types). Isoform - gene correspondence is provided by acorde in the gene_tr_ID data object.

# load and display gene-isoform table
data("gene_tr_ID")
head(gene_tr_ID)
#> # A tibble: 6 × 2
#>   transcript     gene 
#>   <chr>          <chr>
#> 1 NM_001012623.2 Rims1
#> 2 NM_001012625.2 Rims1
#> 3 NM_001083121.2 Enah 
#> 4 NM_001199003.1 Rgs7 
#> 5 NM_001311166.1 Edem3
#> 6 NM_001347195.1 Rgs7

# remove transcripts from single-isoform genes
gspliced <- tasic_sp %>%
  select(transcript) %>%
  left_join(gene_tr_ID, by = "transcript") %>%
  group_by(gene) %>%
  filter(n() > 1)

tasic_sp <- tasic_sp %>%
  filter(transcript %in% gspliced$transcript)

As a summary, here’s a comparison of the number of isoforms remaining in our expression matrix after performing these filtering steps:

Object Content Isoform_no
tasic All isoforms 16240
tasic_de DE isoforms 9275
tasic_sp DE isoforms from multi-isoform genes 6794

6 Computing isoform co-expression using percentile correlations

To detect isoform co-expression across the 7 neural cell types in the Tasic dataset, we will apply percentile correlations. Percentile correlations, as described in [1]), are a metric designed to overcome cell-to-cell effects that generate noise and mask the co-expression signal in the data, yielding low correlation values when using traditional correlation metrics.

Instead, percentile correlations are based on a percentile-summarizing strategy in which cells of the same cell-type are used to estimate a cell type-specific expression distribution for each isoform and cell-level counts are replaced by percentile values. Then, Pearson correlations between isoforms are computed using this percentile-summarized expression. More details can be found in Arzalluz-Luque et al. [1]) and in the figure below:

Acorde includes the percentile_cor() function to compute percentile correlations, which takes a cell - cell-type correspondence table (id_table defined above) and the expression matrix as input, and generates an isoform-to-isoform correlation matrix:

# compute percentile correlations
cors <- percentile_cor(tasic_sp, 
                       id_table = id_table, 
                       percentile_no = 10, 
                       isoform_col = "transcript")

cors[1:4, 1:4]
#>                NM_001033304.2 NM_001346518.1 NM_001040106.2 XM_017321597.2
#> NM_001033304.2      1.0000000      0.8659368      0.5790838      0.8680737
#> NM_001346518.1      0.8659368      1.0000000      0.7007998      0.9458788
#> NM_001040106.2      0.5790838      0.7007998      1.0000000      0.7615355
#> XM_017321597.2      0.8680737      0.9458788      0.7615355      1.0000000

By default, percentile_cor() summarizes isoform expression into 10 percentile values (deciles) per cell type, although users may supply any number between 4 (quantiles) and 100 (percentiles). If a tibble is supplied, users will need to specify the column name in which isoform identifiers are provided in order for percentile_cor() to successfully return isoform IDs as column and row names in the correlation matrix.

7 Semi-supervised isoform clustering

The correlation matrix generated by percentile_cor() can be used to detect groups of isoforms with similar expression patterns across and within cell types, given that percentile correlation captures not only the similarities in expression patterns among the cell types, but also the agreement in expression “behavior” of the isoforms in each of the cell types.

Acorde includes a series of functions for clustering and cluster refinement that, when combined, provide a flexible framework to obtain modules of co-expressed isoforms. Initial clustering is based on the cutreeHybrid() function from the dynamicTreeCut package. dynamicTreeCut is a hierarchical clustering algorithm based on the
selection of optimal cut heights for different branches of the dendrogram, instead of applying the same fixed threshold to separate elements into clusters.

7.1 Initial dynamic clustering with cluster_isoforms()

We will first run the cluster_isoforms() wrapper function, which takes a correlation matrix, generates the necessary inptus and runs cutreeHybrid under the hood:

clusters <- cluster_isoforms(cors, deepSplit = 4, pamStage = FALSE, 
                             minClusterSize = 20)
#> Inferring dendrogram via hclust()...
#> Creating clusters dynamically via cutreeHybrid()...
#>  ..cutHeight not given, setting it to 0.551  ===>  99% of the (truncated) height range in dendro.
#>  ..done.

# show number of clusters
length(clusters)
#> [1] 166

Briefly, deepSplit ranges between 0 and 4, and provides smaller, more accurate clusters when set to high values. Setting pamStage = FALSE allows return of unassigned items, which are placed on the first element of the clusters list. Finally, minClusterSize determines the minimum size of the produced clusters.

In our study, we set these parameters in order to maximize the similarity between isoforms assigned to the same cluster, regardless of the high number of clusters obtained. This configuration was selected because acorde includes a series of steps in the pipeline to refine and merge some of these clusters. These are designed to improve the expression signal while minimizing redundancies in the expression profile that they represent. However, if users want to run clustering using their own parameter setup, cluster_isoforms() can pass any additional parameters supplied to cutreeHybrid().

In spite of being more flexible than regular hierarchical clustering, the dynamicTreeCut algorithm can also generate inconsistent isoform assignments to clusters, i.e. group isoforms with rather different expression profiles.

We’ll now use two of the cluster visualization functions in acorde to view cluster 6 in clusters (clusters[[7]], given that the first element of the list corresponds to unclustered isoforms) as an example of this. Acorde provides a function, calculate_cluster_ctmeans(), to compute the mean and standard error of cell type expression for each of the isoforms in a cluster. In this manner, the similarities of the expression profiles across cell types can be easily compared for same-cluster isoforms. The output of calculate_cluster_ctmeans can be directly provided to plot_cluster_ctmeans() to generate a visual summary of all isoforms in a cluster:

# scale isoform expression
tasic_scaled <- scale_isoforms(tasic_sp, method = "classic", 
                               isoform_col = "transcript")

# calculate cell type mean expression for all isoforms
example_means <- calculate_cluster_ctmeans(tasic_scaled,
                                           isoform_ids = clusters[[7]], 
                                           id_table = id_table,
                                           isoform_col = "transcript")

# plot isoform-level means for all isoforms in the cluster
ctlabs <- c("Astr", "End", "GABA", "Glut", "Micro", "Oligo", "OPC")

plot_cluster_ctmeans(example_means, ct_labels = ctlabs)

7.2 Cluster filtering with filter_clusters()

Some of the isoforms in cluster 6 may have an expression pattern that is slightly different to the rest of the members of the cluster. To solve this, we can use the filter_clusters() function, which will move isoforms to the unclustered groups if they are poorly correlated with most of the isoforms in the cluster.

# see current number of unclustered isoforms
clusters[[1]] %>% length
#> [1] 634

# run filter_clusters
clusters_filt <- filter_clusters(clusters, cor_matrix = cors,
                                 min_cor = 0.9, lowcor_threshold = 2,
                                 contains_unclustered = TRUE,
                                 size_filter = TRUE, size_threshold = 10)

# see number of unclustered isoforms after filtering
clusters_filt[[1]] %>% length
#> [1] 4699

# number of isoforms remaining in clusters
clusters_filt[2:length(clusters_filt)] %>% map_int(length) %>% sum
#> [1] 2095

In this step, isoforms will be removed from a cluster if they have correlation values below min_cor with other members of the cluster. lowcor_threshold provides the maximum number of correlation values lower than min_cor that are allowed per isoform. In addition, size_filter and size_threshold can be used to discard clusters by their size, moving isoforms in clusters that are too small to the unclustered group.

As a result, many of the isoforms that were initially input for clustering are currently not assigned, and we have successfully cleaned the signal of our initial set of clusters. cluster 6 (now in position 6 of the clusters_filt list due to removal of clusters below the size threshold) now looks like this:

# calculate cell type mean expression for all isoforms
example_means.filt <- calculate_cluster_ctmeans(tasic_scaled,
                                           isoform_ids = clusters_filt[[6]], 
                                           id_table = id_table,
                                           isoform_col = "transcript")

# plot isoform-level means for all isoforms in the cluster
plot_cluster_ctmeans(example_means.filt, ct_labels = ctlabs)

7.3 Assigning unclustered isoforms to clusters with expand_clusters()

Next, we will use the expand_clusters() function in acorde to join unclustered isoforms to their most similar cluster. In this process, each cluster’s profile is first summarized into a synthetic representative transcript that we named metatranscript. Metatranscripts are calculated as the mean of the percentile-summarized expression of all isoforms in the cluster. Then, the function computes percentile correlations between isoforms and cluster metatranscripts.

In our study, we assigned unclustered isoforms to a cluster if they showed correlation > 0.9 with its metatranscript (and the maximally correlated cluster was selected as the best match if there were ties).

# first round, expand using hard correlation threshold
clusters_expanded <- expand_clusters(tasic_sp, isoform_col = "transcript", 
                                     id_table = id_table,
                                     cluster_list = clusters_filt[2:length(clusters_filt)],
                                     unclustered = clusters_filt[[1]],
                                     force_expand = FALSE, 
                                     expand_threshold = 0.9,
                                     method = "percentile")

# show output format
names(clusters_expanded)
#> [1] "unclustered" "expanded"
map_chr(clusters_expanded, class)
#> unclustered    expanded 
#> "character"      "list"
map_int(clusters_expanded, length)
#> unclustered    expanded 
#>        1942          67

# check number of unclustered isoforms after first round
length(clusters_expanded$unclustered)
#> [1] 1942

The correlation threshold used to assign an isoform to a cluster can be adjusted via the expand_threshold parameter. To simplify, users may set force_expand = TRUE to assign isoforms to the cluster reporting the highest correlation. In this case, expand_threshold will be ignored.

We can now check the effect of cluster expansion on our example cluster, cluster 6. Note that after expansion, unclustered isoforms are now assigned to the unclustered list element, while the list containing the clusters is situated in the expanded slot. Therefore, cluster 6 is now clusters_expanded$expanded[[5]]:

# compare cluster sizes
  # before expansion
  clusters_filt[[6]] %>% length
#> [1] 26

  # after expansion
  example_expanded <- clusters_expanded$expanded[[5]]
  example_expanded %>% length
#> [1] 82


# calculate cell type mean expression for all isoforms in cluster
example_means.exp <- calculate_cluster_ctmeans(tasic_scaled,
                                               isoform_ids = example_expanded, 
                                               id_table = id_table,
                                               isoform_col = "transcript")

# plot isoform-level means for all isoforms in the cluster
plot_cluster_ctmeans(example_means.exp, ct_labels = ctlabs)

7.4 Eliminating redundancies across cluster profiles with merge_clusters()

At this point, we have focused on the reduction of within-cluster variability, which results in a large number of small, redundant clusters. To mitigate this, acorde includes the merge_clusters() function, which detects expression profile similarities across clusters (redundancy) and merges them into a single cluster. Briefly, merge_clusters employs a clustering approach using the metatranscripts for computed clusters as input, which is done via regular hierarchical clustering (by default) or using the dynamic approach implemented in the cutreeHybrid() function from the dynamicTreeCut package (when dynamic = TRUE is set).

For our study, we run regular hierarchical clustering of metatranscripts via the dynamic = FALSE parameter. When set to TRUE, this function can perform dynamic clustering for cluster merge and passes arguments to cutreeHybrid(). In this case, however, we just set height_cutoff = 0.1 as non-default parameters for merge_clusters() to pass on to R stats function cutree():

merge.output <- merge_clusters(tasic_sp, id_table = id_table,
                               cluster_list = clusters_expanded$expanded,
                               method = "percentile",
                               dynamic = FALSE,
                               height_cutoff = 0.1,
                               isoform_col = "transcript")

merge_clusters() returns a nested list including two elements: first, a list containing the merged cluster indices, allowing traceback of all merged decisions; and second, a list containing the merged clusters obtained as a result.

# show output format
map_chr(merge.output, class)
#> merged_groups      clusters 
#>        "list"        "list"
map_int(merge.output, length)
#> merged_groups      clusters 
#>            26            26

# retrieve outputs
merged_groups <- merge.output[[1]]
clusters_merged <- merge.output[[2]]

# show merge decision tree
head(merged_groups)
#> $`1`
#> [1]  1  2  6 21
#> 
#> $`2`
#> [1]  3  4  8 27 34 36
#> 
#> $`3`
#> [1]  5 20
#> 
#> $`4`
#> [1] 7
#> 
#> $`5`
#> [1]  9 13 44 51
#> 
#> $`6`
#> [1] 10 15

# show merged cluster formats
map(head(clusters_merged), head, 5)
#> $`1`
#>               11               12               13               14 
#> "NM_001346518.1" "XM_006514310.3" "XR_001778696.2" "XM_006538935.2" 
#>               15 
#> "XM_030253589.1" 
#> 
#> $`2`
#>               31               32               33               34 
#> "NM_001033304.2" "XM_006512556.2"    "NR_027907.1"      "PB.1034.1" 
#>               35 
#>     "PB.10636.2" 
#> 
#> $`3`
#>               51               52               53               54 
#>    "XR_386367.2" "XM_011241643.3"     "PB.10772.5"     "PB.11344.3" 
#>               55 
#>     "PB.11848.9" 
#> 
#> $`4`
#>            71            72            73            74            75 
#>  "PB.10330.2" "PB.11604.12" "PB.11604.28"  "PB.13168.1"  "PB.13169.1" 
#> 
#> $`5`
#>           91           92           93           94           95 
#> "PB.10081.2" "PB.10599.3" "PB.11122.2" "PB.11848.3" "PB.1315.13" 
#> 
#> $`6`
#>          101          102          103          104          105 
#> "PB.11505.3" "PB.1155.17" "PB.11784.4" "PB.14081.8" "PB.14876.5"

After merge, the number of clusters has been greatly reduced. We can now check the results of any of the cluster merge decisions made by merge_clusters to verify that they are correct. Let’s take, for instance, merged cluster 3:

# plot an example of clusters that have been merged together
example_group <- merged_groups[[3]]

example_group
#> [1]  5 20

# compute cell type mean expression for merged clusters
merge_check <- clusters_expanded$expanded[example_group] %>% 
  map(~calculate_cluster_ctmeans(tasic_scaled,
                                 isoform_ids = .,
                                 id_table = id_table,
                                 isoform_col = "transcript"))

# create plots and plot grid
merge_check_plots <- seq(1, length(merge_check)) %>% 
  map(~plot_cluster_ctmeans(merge_check[[.]], 
                            plot_title = paste("Cluster", example_group[.]), 
                            ct_labels = ctlabs))

plot_grid(plotlist = merge_check_plots)

To enable a more summarized and elegant view of cluster profiles than that generated by plot_cluster_ctmeans(), acorde also includes a two functions, calculate_cluster_profile() and plot_cluster_profile(). These will average cell type mean values for all transcripts and compute the standard deviation. In this manner, we can better evaluate similarities among cluster members, and see whether the patterns represented by clusters are truly distinct or still contain redundancies.

# compute cluster mean and standard deviations
patterns_merged <- map(clusters_merged,
                       ~calculate_cluster_profile(tasic_scaled,
                                   isoform_ids = .,
                                   id_table = id_table,
                                   isoform_col = "transcript"))

patterns_merged[[1]]
#> # A tibble: 7 × 4
#>   tr            mean    sd cell_type                     
#>   <chr>        <dbl> <dbl> <chr>                         
#> 1 silhouette -0.402  0.175 Astrocyte                     
#> 2 silhouette -0.422  0.168 Endothelial Cell              
#> 3 silhouette  0.0161 0.127 GABA-ergic Neuron             
#> 4 silhouette  0.0714 0.131 Glutamatergic Neuron          
#> 5 silhouette -0.422  0.182 Microglia                     
#> 6 silhouette -0.419  0.170 Oligodendrocyte               
#> 7 silhouette -0.410  0.172 Oligodendrocyte Precursor Cell

# generate plots using output from calculate_cluster_profile()
pattern_plots_merged <- map(patterns_merged, plot_cluster_profile,
                                ct_labels = ctlabs)

# plot a grid with all clusters
plot_grid(plotlist = pattern_plots_merged,
              labels = seq(1, length(pattern_plots_merged)), ncol = 4)

8 Additional steps to improve clustering results

As shown above, clustering of metatranscripts generally works well for removing redundancies. However, clustering results can be improved by further combining acorde functions. For our manuscript, we performed a series of cluster refinement steps to assign isoforms that remained unclustered and generate fewer, more accurate clusters.

8.1 Clusters with noisy profiles

First, among the clusters generated as a result of unsupervised clustering, there are a few that show strong variability in comparison to the global cluster mean, given the broad standard deviation ribbon in some of the panels in the plot above above. In particular, we tackled clusters 7, 12 and 14. Note that, for cases where clusters were noisy but represented a unique profile (i.e. an expression pattern across cell types not captured by any other cluster, for instance, cluster 26), we did not perform this step to avoid removing the expression pattern entirely.

To mitigate this, we decided to join isoforms in clusters that had a visibly inaccurate (i.e. noisy) profile to those that remained unclustered:

# select noisy clusters
noisy_idx <- c(7, 12, 14)

# join all non-assigned isoforms into one list
unclustered <- c(clusters_expanded$unclustered, 
                 unlist(clusters_merged[noisy_idx]))

# remove noisy clusters from clusters_merged list 
clusters_merged <- clusters_merged[-noisy_idx]

length(unclustered)
#> [1] 2381

8.2 Refinement of merged clusters

Merging clusters may create discrepancies between same-cluster isoforms, since they were originally assigned to different clusters. To make clusters more homogeneous for downstream assignment of unclustered isoforms, we run filter_clusters() again with the following parameters:

# run filter
clusters_merged.filtered <- filter_clusters(clusters_merged, cor_matrix = cors,
                                            min_cor = 0.7, lowcor_threshold = 1,
                                            contains_unclustered = FALSE,
                                            size_filter = TRUE,
                                            size_threshold = 10)

# join unclustered to the rest of isoforms removed by the filter
unclustered <- c(clusters_merged.filtered[[1]],
                 unclustered)

# view total no. of unclustered isoforms
length(unclustered)
#> [1] 3039

# check total no. of clusters (first slot contains only unclustered)
length(clusters_merged.filtered)
#> [1] 24

Note that clusters containing less than 10 isoforms (as defined by the size_threshold) parameter will be discarded and their isoforms moved to the unclustered group.

8.3 Assignment of remaining unclustered isoforms

We next assigned all unclustered isoforms to the remaining 23 clusters. Given that there is a high number of isoforms with no cluster assigned, we used a recursive assignment strategy in which the correlation threshold used for grouping decreased after each iteration. In this manner, isoforms with high similarities with the cluster profile will be assigned first, contributing to strengthen the profile and helping drive the assignment in the next iteration.

# define threshold vector
thres <- c(0.9, 0.8, 0.7)

clusters_merged.expanded <- list()
clusters_merged.expanded$unclustered <- unclustered
clusters_merged.expanded$expanded <- clusters_merged.filtered[2:length(clusters_merged.filtered)]

for(t in thres){
  clusters_merged.expanded <- expand_clusters(tasic_sp, 
                                      id_table = id_table,
                                      cluster_list = clusters_merged.expanded$expanded,
                                      unclustered = clusters_merged.expanded$unclustered,
                                      force_expand = FALSE,
                                      expand_threshold = t,
                                      method = "percentile",
                                      isoform_col = "transcript")
}
#> Expanding clusters using  percentile correlation...
#> New names:
#> * mean_percentile -> mean_percentile...1
#> * mean_percentile -> mean_percentile...2
#> * mean_percentile -> mean_percentile...3
#> * mean_percentile -> mean_percentile...4
#> * mean_percentile -> mean_percentile...5
#> * ...
#> Expanding clusters using  percentile correlation...
#> New names:
#> * mean_percentile -> mean_percentile...1
#> * mean_percentile -> mean_percentile...2
#> * mean_percentile -> mean_percentile...3
#> * mean_percentile -> mean_percentile...4
#> * mean_percentile -> mean_percentile...5
#> * ...
#> Expanding clusters using  percentile correlation...
#> New names:
#> * mean_percentile -> mean_percentile...1
#> * mean_percentile -> mean_percentile...2
#> * mean_percentile -> mean_percentile...3
#> * mean_percentile -> mean_percentile...4
#> * mean_percentile -> mean_percentile...5
#> * ...

# remaining unclustered
length(clusters_merged.expanded$unclustered)
#> [1] 144

For the remaining isoforms, we set force_expand = TRUE to assign them to the most similar cluster (i.e. the one exhibiting the highest percentile correlation with each isoform’s expression).

clusters_merged.expanded <- expand_clusters(tasic_sp, id_table = id_table,
                                            cluster_list = clusters_merged.expanded$expanded,
                                            unclustered = clusters_merged.expanded$unclustered,
                                            force_expand = TRUE,
                                            method = "percentile",
                                            isoform_col = "transcript")
#> Expanding clusters using  percentile correlation...
#> New names:
#> * mean_percentile -> mean_percentile...1
#> * mean_percentile -> mean_percentile...2
#> * mean_percentile -> mean_percentile...3
#> * mean_percentile -> mean_percentile...4
#> * mean_percentile -> mean_percentile...5
#> * ...

8.4 Final merge of redundant profiles

Our parameter choice for the initial cluster merge was rather lenient, meaning that it was adjusted to sacrifice some potentially correct merge decisions in order to avoid others that may result in highly dissimilar clusters being merged. For this reason, there may still be some cluster profiles that remain highly similar.

To remove these redundancies, we run merge_clusters() again with dynamic = FALSE and similar parameters:

# merge
merge_output_final <- merge_clusters(tasic_sp, id_table = id_table,
                               cluster_list = clusters_merged.expanded,
                               method = "percentile",
                               dynamic = FALSE,
                               height_cutoff = 0.1,
                               isoform_col = "transcript")
#> New names:
#> * mean_percentile -> mean_percentile...1
#> * mean_percentile -> mean_percentile...2
#> * mean_percentile -> mean_percentile...3
#> * mean_percentile -> mean_percentile...4
#> * mean_percentile -> mean_percentile...5
#> * ...

clusters_final <- merge_output_final[[2]]

# plot final patterns
patterns_final <- map(clusters_final,
                      ~calculate_cluster_profile(tasic_scaled,
                                                 isoform_ids = .,
                                                 id_table = id_table,
                                                 isoform_col = "transcript"))

pattern_plots_final <- map(patterns_final, plot_cluster_profile,
                               ct_labels = ctlabs)

plot_grid(plotlist = pattern_plots_final, 
          labels = seq(1, length(pattern_plots_final)))

Remarkably, there are still two pairs of highly similar clusters that have not been merged by automatic re-clustering, i.e. clusters 6 and 15, and clusters 12 and 16. To solve this and avoid the detection of false-positive coDIU genes due to the presence of redundant profiles, we merged these clusters manually:

# list manual merges
manual_list <- list(c(6, 15),
                    c(12, 16))

# merge clusters 
clusters_final.curated <- c(clusters_final[-unlist(manual_list)],
                            map(manual_list, ~unlist(clusters_final[.])))

# set cluster names
names(clusters_final.curated) <- as.character(seq(1, length(clusters_final.curated)))

9 Keep isoforms from genes with Differential Isoform Usage (DIU)

As a result of clustering, we can next evaluate whether genes that have clustered isoforms are positive for Differential Isoform Usage (DIU). DIU genes must have more than one clustered isoform, and at least two of these isoforms assigned to different clusters. Differential cluster assignment indicates different isoform usage in at least one cell type, and is therefore a straightforward way to call DIU when multiple cell groups are considered.

To detect DIU genes, isoforms must be removed from clusters in the following cases: 1) if these isoforms belong to genes that have a single isoform assigned to clusters and 2) if they belong to genes with two or more clustered isoforms that have no same-gene counterparts in any of the other clusters. Both filters can be applied running the keep_DIU() function in acorde:

# remove isoforms from non-DIU genes
clusters_diu <- keep_DIU(clusters_final.curated, 
                         gene_tr_table = gene_tr_ID)
#> Total no. of clusters: 15
#> Total isoforms in clusters: 6794
#> Isoforms clustered after >1 isoform/gene filter: 6794
#> Isoforms clustered after differential cluster assignment filter: 5278

To obtain a list of all DIU genes that have isoforms in our clusters, we simply need to use the gene_tr_ID table to translate transcript to gene IDs:

# obtain gene ID-based clusters
clusters_diu.gene <- map(clusters_diu, 
                          ~gene_tr_ID[match(., 
                                            gene_tr_ID$transcript),]$gene)

# obtain list of DIU genes
diu_genes <- unlist(clusters_diu.gene) %>% unique

# total number of DIU genes:
length(diu_genes)
#> [1] 2017

In summary, we currently have clustered 5278 isoforms into 15 expression patterns across the 7 cell types in the Tasic dataset, and these isoforms belong to 2017 DIU genes.

10 Detection of genes with co-Differential Isoform Usage (coDIU)

We define co-Differential Isoform Usage (coDIU) as an isoform expression pattern in which a group of genes shows co-expression of their isoforms, but no co-expression is detected when considering only gene-level expression. A coDIU situation between a pair of genes, gene a and gene b, is represented below:

First, we recommend adjusting some global parameters to allow heavy computation to take place (note that the exact value may depend on your system requirements):

options(future.globals.maxSize = 768 * 1024^2)

Then, we will use the find_codiu_genes() function in acorde to generate a list of potentially coDIU gene pairs, that is, genes that have at least two of their isoforms assigned to the same clusters, therefore showing isoform-level co-expression across cell types. We often refer to these as “shared genes”, given that they share isoforms across two or more clusters:

# find shared gene pairs
shared_pairs <- find_codiu_genes(clusters_diu, gene_tr_table = gene_tr_ID)

# show dimensions of results
dim(shared_pairs)

However, clustering can allow expression pattern variability among cluster members, and sometimes isoforms in a cluster might not exactly reflect their expression pattern. Especially when coDIU is detected between two clusters reflecting a closely-related pattern (for instance, similar expression except for one cell type), there may be some false-positives coDIU genes among those detected by find_codiu_genes().

To control for this, acorde provides a statistical test for each potentially coDIU gene pair, i.e. gene 1 and gene 2, for which isoforms were detected in two clusters, i.e. cluster 1 and cluster 2. We here test two different conditions for coDIU: a) that the average profile across cell types of the two isoforms in cluster 1 is significantly different to the average profile of the two isoforms in cluster 2; and b) that the average profile of the two isoforms of gene 1 is not different to the average profile of the two isoforms of gene 2.

For each pair of genes, the test will return two p-values, each corresponding to one of the above-described questions:

  • cluster:cell_type: should be significant if condition a is fulfilled.
  • gene:cell_type: should NOT be significant if condition b is satisifed.

The test is implemented in the test_codiu_genes() function, and can be run as follows (set t and parallel = TRUE -default- for parallel computation):

# test shared gene pairs
codiu_test <- test_codiu_genes(tasic_sp,
                               isoform_col = "transcript_id",
                               cluster_list = clusters_DIU,
                               shared_genes = shared_pairs,
                               gene_tr_table = gene_tr_ID,
                               id_table = id_table,
                               t = 7)

# obtain pvalues
pvalue.df <- map(codiu_test, "pvalues") %>% bind_rows

# adjust p-values
pvalue.df$`cluster:cell_type` <- p.adjust(pvalue.df$`cluster:cell_type`, method = "BH")
pvalue.df$`gene:cell_type` <- p.adjust(pvalue.df$`gene:cell_type`, method = "BH")

Now, we can filter the shared_pairs matrix to only keep genes that satisfy the two conditions for coDIU when tested with at least one other gene:

# filter shared gene list to only keep significant interactions
sig_pairs <- shared_pairs[,pvalue.df$`cluster:cell_type` < 0.05 &
                              pvalue.df$`gene:cell_type` > 0.05]
sig_all <- sig_pairs %>% as.character %>% unique

Finally, using the gene IDs in sig_all, we can now filter the clusters to only include isoforms from significantly coDIU genes, and generate a list of clusters with coDIU gene IDs:

# filter clusters to only include isoforms from coDIU genes
keep_coDIU <- clusters_DIU.gene %>% map(~(. %in% sig_all))
clusters_coDIU <- map2(clusters_DIU, keep_coDIU, ~(.x[.y]))

# convert coDIU clusters to gene IDs
clusters_coDIU.gene <- map(clusters_coDIU,
                           ~gene_tr_ID[match(., gene_tr_ID$transcript),]$gene)

11 Using acorde results for functional analysis

Providing a full description of functional analysis in the acorde manuscript is beyond the scope of this vignette. However, we will provide users with some tips, should they want to do a similar analysis of their single-cell expression clusters using tappAS.

11.1 Obtaining a functionally-annotated transcriptome with IsoAnnotLite

Prior to functional analysis, we transferred functional annotations to our long read-defined isoforms using IsoAnnotLite. Briefly, IsoAnnotLite takes SQANTI3 output files and a previously-annotated tappAS GFF3 file [4] (which can be found here) as inputs to generate a new, tappAS-compatible GFF3 file. To achieve this, transcripts are positionally matched to those in the pre-existing annotation and functional features transferred if they are situated in overlapping genomic positions.

For this study, given that we used a RefSeq transcriptome for long-read isoform definition, we used tappAS’ Mus musculus RefSeq functional annotation (GRCm38, RefSeq release 78). Details for this process are available in Supplementary Note 1 in our manuscript (Arzalluz-Luque et al.).

11.2 Generating tappAS project inputs from single-cell data

Even though tappAS does not currently support single-cell data, our study took advantage of the qualitative analysis modules in the application, i.e.  Functional Diversity Analysis and Functional Enrichment Analysis [4], to obtain insights on the functional features and functionalities that changed as a result of isoform co-expression and coDIU mechanisms.

To load single-cell RNA-Seq data into tappAS, we pretended to have a Time-Course Single Series design using cell IDs as sample IDs and cell types as time points. Please note that this strategy is not valid for quantitative analyses, since tappAS does not implement single-cell dedicated analysis methods.

Using the metadata table, a tappAS design file can be generated as follows:

# create design table from metadata
design <- metadata %>% 
  select(run, cell_type) %>%
  dplyr::rename(sample = "run", time = "cell_type") %>%
  mutate(time = factor(time) %>% as.numeric(), group = rep("case", nrow(metadata))) %>%
  arrange(time)

# subset design to alleviate computational burden
design_sub <- design %>%
  group_by(time) %>%
  slice_sample(n = 10)

# export
write_tsv(design_sub, "tappas_design.tsv")

To generate an expression matrix for tappAS, run:

# create matrix with samples in design file
matrix_sp <- tasic_sp %>%
  select(transcript, all_of(design_sub$sample)) %>% 
  column_to_rownames("transcript")

write.table(matrix_sp, sep = "\t", "tappas_matrix_sp.tsv")

Depending on the target isoform set, users may want to export the DE expression (to use the full dataset) or the muli-isoform count matrices (to only include isoforms input for clustering). This will mostly affect Functional Enrichment Analyses and the definition of background gene lists. For instance, to run a functional enrichment of DIU genes vs genes with at least one DE isoform, users must create a project including transcripts from all genes in that background list, i.e. using the tasic_de expression matrix.

Creating specific test and background gene lists is particularly useful for Functional Enrichment Analysis. Following the example in our study, we generated a specific list with all coDIU genes that included isoforms in oligodendrocyte, neuron and neuron-oligo clusters(i.e. clusters 1, 4 and 14). This list was used as test list for Functional Enrichment of GO terms, and can be generated using the fromList()function (modified from the UpSetR package), which generates a binary occurrence table indicating the clusters in which each coDIU gene has isoforms:

# create occurrence table
gene_occurrence <- acorde::fromList(clusters_coDIU.gene) %>%
  rownames_to_column("gene")

# select clusters
clust_select <- c(1, 4, 14)

# select shared gene group
clust_pair1 <- dplyr::select(gene_occurrence, "1", "4")
gene_group1 <- gene_occurrence$gene[rowSums(clust_pair1) == 2] %>% tibble(gene = .)

clust_pair2 <- dplyr::select(gene_occurrence, "4", "14")
gene_group2 <- gene_occurrence$gene[rowSums(clust_pair2) == 2] %>% tibble(gene = .)

gene_group <- bind_rows(gene_group1, gene_group2) %>% unique

# write file
write_tsv(gene_group,
          "clust_1_4_14_genes.tsv",
          col_names = FALSE)

In addition, a transcript inclusion list can be provided upon tappAS project creation to filter both the expression matrix and annotation files and create a project including only a transcript set of interest. This may be interesting if a particular group of genes is to be analyzed further. Following the same example, this would be the list used to create a project including only isoforms from clusters 1, 4 and 14; in order to perform the Functional Diversity Analysis in our neuron-oligo cluster results section:

# select transcripts from shared genes in clusters
tr_clust.idx <- map(clust_select,
                    ~which(clusters_coDIU.gene[[.]] %in% gene_group$gene))
tr_clust <- map2(clust_select, tr_clust.idx,
                 ~clusters_coDIU[[.x]][.y])

tr_clust_shared <- tibble(transcript = unlist(tr_clust))

# write file
write_tsv(tr_clust_shared,
          "clust_1_4_14_transcripts.tsv",
          col_names = FALSE)

In conclusion, the final output of the acorde pipeline consists in:

  • Isoform clusters representing unique expression patterns across cell types.
  • A list of identifiers of both DIU and coDIU genes, which can then be used for functional characterization.

More information about DIU and coDIU gene characterization and the type of functional insights that can be obtained from the acorde output can be found in Arzalluz-Luque et al..

12 References

If you use acorde in your research, please cite:

  • [1] Angeles Arzalluz-Luque, Pedro Salguero, Sonia Tarazona, Ana Conesa: Acorde: unraveling functionally-interpretable networks of isoform co-usage from single cell data. bioRxiv 2021.05.07.441841 (2021); doi: https://doi.org/10.1101/2021.05.07.441841

Long-read data was obtained from the ENCODE consortium, and made publicly available by Wyman et al. in their bioRxiv preprint:

  • [2] Dana Wyman, Gabriela Balderrama-Gutierrez, Fairlie Reese, Shan Jiang, Sorena Rahmanian, Weihua Zeng, Brian Williams, Diane Trout, Whitney England, Sophie Chu, Robert C. Spitale, Andrea Tenner, Barbara Wold, Ali Mortazavi: A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. bioRxiv 672931 (2020); doi: https://doi.org/10.1101/672931

Short-read scRNA-seq data was obtained from Tasic et al. (2016):

  • [3] Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nature Neuroscience. 19, 335–346 (2016).

  • [4] de la Fuente, L., Arzalluz-Luque, Á., Tardáguila, M. et al. tappAS: a comprehensive computational framework for the analysis of the functional impact of differential splicing. Genome Biology 21, 119 (2020). https://doi.org/10.1186/s13059-020-02028-w.